News-Oriented Keyword Indexing with Maximum Entropy Principle
نویسندگان
چکیده
In our information era, keywords are very useful to information retrieval, text clustering and so on. News is always a domain attracting a large amount of attention. Aiming at news documents' characteristics and the resources available, this paper proposes to use Maximum Entropy (ME) model to conduct automatic keyword indexing. The focus of ME-based keyword indexing is how to obtain all the candidate items and select useful features for ME model. First, we make use of some relatively mature linguistic techniques and tools to obtain all the possible candidate items. Then, a feature set of ME model will be introduced. At last we test the model, and experimental results are given.
منابع مشابه
Keyword Spotting Using Durational Entropy
This paper deals with the task of detection of a given keyword in continuous speech. We build upon a previously proposed algorithm where a modified Viterbi search algorithm is used to detect keywords, without requiring any explicit garbage or filler models. In this work, the concept of durational entropy is used to further discard a large fraction of false alarm errors. Durational entropy is de...
متن کاملNews-Oriented Automatic Chinese Keyword Indexing
In our information era, keywords are very useful to information retrieval, text clustering and so on. News is always a domain attracting a large amount of attention. However, the majority of news articles come without keywords, and indexing them manually costs highly. Aiming at news articles’ characteristics and the resources available, this paper introduces a simple procedure to index keywords...
متن کاملAn Information-Theoretic Framework for Semantic-Multimedia Indexing
To solve the problem of indexing collections with diverse text documents, image documents, or documents with both text and images, one needs to develop a model that supports heterogeneous types of documents. In this paper, we show how information theory supplies us with the tools necessary to develop a unique model for text, image, and text/image retrieval. In our approach, for each possible qu...
متن کاملWork-in-Progress: Automated Named Entity Extraction for Tracking Censorship of Current Events
Tracking Internet censorship is challenging because what content the censors target can change daily, even hourly, with current events. The process must be automated because of the large amount of data that needs to be processed. Our focus in this paper is on automated probing of keyword-based Internet censorship, where natural language processing techniques are used to generate keywords to pro...
متن کاملA statistical framework for fusing mid-level perceptual features in news story segmentation
News story segmentation is essential for video indexing, summarization and intelligence exploitation. In this paper, we present a general statistical framework, called exponential model or maximum entropy model, that can systematically select the most significant mid-level features of various types (visual, audio, and semantic) and learn the optimal ways in fusing their combinations in story se...
متن کامل